Introduction to GGplot2
The Grammar of Graphics
ggplot2 is one of the core packages under the tidyverse package. It is more flexible and versatile than the graphs produced by the base R package.
The “gg” stands for “Grammar of Graphics”, a book by Leland Wilkinson that offers tools to concicley describe the components of a graphic.
ggplot2 logic stems from this idea, that you can build every graph from the same few components: a data set, visual marks (geoms) representing the data, and a coordinate system.
As Hadley Wickham explained: “You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”
Grammatical elements of ggplot2
A key feature of ggplot2 is that it allows to layer graphical elements on top of each other, creating elaborate visualizations.
- Data - the data frame we want to use for our plot
- Aesthetics (aes) - the scales we want to map our data onto
- Geometrics (geom) - the geometrical shapes representing our data
- Themes - the appearance of the non-data aspects of the plot
- Statistics - representations of the date
- Coordinates/Scales - the range and limits our plot
- Facets - the layout of multiple plots and subplots
The first three elements: data, aesthetics (aes), and geometrics (geom), are the basic elements. We must define them in the ggplot function in order to produce a meaningful plot.
The remaining elements are “optional”, that is, they are set to a default. This means we are not required to define them when we plot, though typically we would want to adjust them.
In this presentation I will focus mainly on the first three and the most commonly used geoms.
Lets get to work!
Installing packages
Begin by installing and loading the tidyverse package, which includes ggplot2, among other usefull packages such as dplyr and tidyr which are used for manipulating data prior to plotting.
You only need to install the package once, but you will need to “load” it every time you restart a session
If you solely want to install the ggplot2 package you can use a similar line of code, but you will most likely use dplyr, so you may as well install tideyverse which includes both (and more)
Our Data
For this exercise we will use diamonds from the dataset package, and the gapminder dataset from the gapminder package. Both are available on r.
## Warning: package 'gapminder' was built under R version 3.6.3
We will start working with the diamonds data.
The first step should always be to examine the dataset. What variable we have? What datatype is each variable? How many observations are included?
You can use the structure function str(), or the summary function summary() if you want more details on each variable.
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
If only want to know the variable names, you can simply list the column of the dataset using colnames().
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
Now let’s continue exploring the data by plotting it with ggplot2
The ggplot2 syntax
The first line of code in ggplot2 requires us to input the data frame we intend to use, and the aesthetics we want to map our data on. This line typically includes all the data needed for creating the plot. The function synatx is writtern as: ggplot(data, aes())
For instance, to plot the price of diamonds based on their carat we need to set “diamonds” as the data, and map “carat” and “price” onto the x and y aesthetics.
The function can be written either as: ggplot(data = diamonds, aes(x = carat, y = price)) or simply as: ggplot(diamonds, aes(carat, price))
This creates the base layer of our plot, which includes the dimensions we defined for the aesthetics. In order to present the observations, we need to add geometric layers. For every layer we add, we need to place a “+” sign.
For instance, to present a trend line of the average price by carat, we can add a geom_smooth() layer. This geom creates a regression line with a confidence intervals.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
However, a regression line is not very telling about the observations. In this instance, it would make more sense to create a scatterplot in order to see the spread of the observations. We can do this by adding a geom_point() layer.
Scatterplots
Many of the observations are overlapping, making it difficult to see the actual distribution. To help remedy overplotting, we can adjust the transparency of the points by reducing the alpha and also increase the size of the points inside the geom_point layer.
This looks better, but it is still difficult to make insights from this plot. We can add another aesthetic mapping to deferantiate between diamonds with different cuts. In this example we will map “cut” onto the color aesthetic in the ggplot line,
As mentioned previously, ggplot2 enables us to add multiple geom layers on top of each other. Each new geom layer will appear on top of the previous layers. And don’t forget to add another “+” sign.
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(se = FALSE)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The aesthetics defined in the first line are automatically adopted by all the geom layers. Aesthetics defined in an individual geom layer affect only that geom layer, and can override aesthetic mappings from the main ggplot() line.
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(color = "deeppink3")## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can add multiple geom layers of the same type
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(color = "deeppink3", se = FALSE) +
geom_smooth(color = "blue", method = lm) +
ylim(0,20000)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 38 rows containing missing values (geom_smooth).
Each geom type has multiple arguments which are set to to default values, which we can easily change based on our needs. For instance, geom_point can take arguments relating to x, y, alpha, color, fill, shape, and weight . In the previous examples we changed the alpha and size of the points.
For a cheat sheet with ggplot2 geom argumentsby rStudio visit this link.
Improving the plot
Before we continue, let’s make make our lives a bit easier. Instead of typing the function over and over again, we can assign the function to an object and simply add layers to that object.
## assigning a ggplot2 function to an object
dd <- ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) + geom_point(alpha = 0.4, size = 2)
# same as: dd <- ggplot(diamonds, aes(carat,price, color = cut)) + geom_point(alpha = 0.4, size = 2)Now we can add layers and adjustments to “dd” which already containts our predefined ggplot() + geom_point() .
vertical lines
We can add lines to indicate the median and mean of carat. To add vertical and horizontal lines we use geom_vline() and geom_hline() correspondingly.
dd +
geom_vline (xintercept = 0.7, linetype = "dashed",color = "#b22222") +
geom_vline (xintercept = 0.7979, linetype = "dashed",color = "turquoise4")We can also add tags to the lines with geom_tex() to indicate what they represent.
dd +
geom_vline (xintercept = 0.7, linetype = "dashed",color = "#b22222") +
geom_vline (xintercept = 0.7979, linetype = "dashed",color = "turquoise4") +
geom_text(aes(x=0.7, label="carat median", y=14000), vjust = -1, color= "#b22222", angle=90, size=3) +
geom_text(aes(x=0.7979, label="carat mean", y=14000), vjust = 1, color= "turquoise4", angle=90, size=3)Even though we improved the plot, we can see that much of the data is condenced on the left side of the plot. We can handle this by adjusting the data or, better yet, adjusting the scale.
Adjusting the data
Using dplyr functions, we can filter out observations greater than 3 carats. We’ll create a new dataset by saving the filtered data into an object called “smallD”
We then plot the same aesthetics using the new data frame “smallD”
Adusting the scales
Instead of filtering out extreme observations, we can adjust the x axis, either by changing its limits with xlim(), or by LOGing the values of the x scale with scale_x_log10()
Limiting the scale deletes the points outside the limit range
## Warning: Removed 32 rows containing missing values (geom_point).
Limiting the x scale for the
diamond dataset created a graph that is identical to the one we created with the smallD dataframe.
Loging the scale keeps all the data points, but stretches the axis exponentially
LOGing is useful when the data is very skewed, as in the case of the
gapminder data. But for the diamond dataset, I would probably choose to limit the axis scale.
Facets & Themes
We can further exmaine diferences by arranging the data into subplots. with facet_grid() and facet_wrap()
Bar Charts
The height of bars geom_bar() represents the number of cases in each group. Thus it only takes an “x” aesthetic.
The height of bars geom_col() represents other other values in the data, which is why it also requires a “y” aesthetic. _
geom_bar()
Asignng the color aesthetic would change the color of the outlines rather then the fill of the bars. to change the color of the bars we use the fill aesthetic.
Assigning the fill to another variable, splits each bar into subgroups
The default position is set to “stack”, which is why the cut levels are stacked upon each other. The other options are position = “fill” which fills each bar to represent 100%. The third option is position = “dodge” which places the groups next to eachother
ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "dodge") +
facet_grid(. ~ clarity) +
theme(axis.text.x = element_text(angle = 90)) Finaly, you can also change the direction of the bar by fliping it 90 degrees with coord_flip(), or create a circular center with coord_polar()
ggplot(diamonds, aes(x = color, fill = color)) +
geom_bar() +
coord_flip() +
theme(legend.position = "none")ggplot(diamonds, aes(x = color, fill = color)) +
geom_bar() +
coord_polar() +
theme(legend.position = "none") Boxplots
ggplot(diamonds, aes(color, price, fill = color)) +
geom_boxplot() +
labs(title = "My amazing diamnd boxplot chart", x = "Diamond Color Grade", y = "Price")ggplot(diamonds, aes(color, price, fill = cut)) +
geom_boxplot() +
labs(title = "Diamond price by color grade and cut", x = "Diamond Color Grade", y = "Price")Line graphs
Line graphs produced by geom_line are suitable for longitudinal data in which we desire to show variance over time, or between different treatments. For the diamond data a line graph will look like a hot mess.
To demostrate the line geom, We will transfer to the gapminder data which contains information on life expectancy of countries at different points of time
The Gapminder data
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
I created a new datafram by grouping continenet and year, and adding a new variable of the average life expectency
yearContinent <- gapminder %>%
group_by(year, continent) %>%
summarize(totalPop = sum(as.numeric(pop)),AverageLifeExp = mean(lifeExp))
yearContinent## # A tibble: 60 x 4
## # Groups: year [12]
## year continent totalPop AverageLifeExp
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 237640501 39.1
## 2 1952 Americas 345152446 53.3
## 3 1952 Asia 1395357351 46.3
## 4 1952 Europe 418120846 64.4
## 5 1952 Oceania 10686006 69.3
## 6 1957 Africa 264837738 41.3
## 7 1957 Americas 386953916 56.0
## 8 1957 Asia 1562780599 49.3
## 9 1957 Europe 437890351 66.7
## 10 1957 Oceania 11941976 70.3
## # … with 50 more rows
ggplot(yearContinent, aes(x = year, y = totalPop, color = continent)) +
geom_point() +
geom_line() +
expand_limits(y = 0)Reminder That LOGing scales helps when the data is very skewed scales and LOG. So lets put everythin together.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
facet_wrap(~ year, ncol = 3)ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,color = continent)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ year, ncol = 3)gm <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,color = continent)) +
geom_point() +
scale_x_log10()gm +
facet_grid(continent ~ year) +
theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position = "none") +
labs(title = "Life Expectancy by GDP, Continent and Year", x = "GDP", y = "Life Expectancy")Tutorials
Continue learning and practicing ggplot2 on your own:
- Data Visualization - in R for Data Science - Hadley Wickham’s e-book
- The Complete ggplot2 Tutorial - by Selva Prabhakaran
- Stack Overflow - Great for asking questions from the community
- Data Camp course - first lesson of each course is free
- Interactive charts - convert your ggplot2 figures into interactive ones powered by plotly.js